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Abstract. In this paper we introduce a framework for computing up- 
per bounds yet accurate WCET for hardware platforms with caches and 
pipelines. The methodology we propose consists of 3 steps: 1) given a 
program to analyse, compute an equivalent (WCET-wise) abstract pro- 
gram; 2) build a timed game by composing this abstract program with a 
network of timed automata modeling the architecture; and 3) compute 
the WCET as the optimal time to reach a winning state in this game. We 
demonstrate the applicability of our framework on standard benchmarks 
for an ARM9 processor with instruction and data caches, and compute 
the WCET with UPPAAL-TiCA. We also show that this framework can 
easily be extended to take into account dynamic changes in the speed of 
the processor during program execution. 



1 Introduction 

Embedded real-time systems are composed of a set of tasks (software) that run 
on a given architecture (hardware). These systems are subject to strict timing 
constraints and these constraints must be enforced by a scheduler. Designing an 
efFcctivc scheduler is possible only if some bounds arc known about the execution 
times of each task. For simple scheduling algorithms e.g., non preemptive, the 
knowledge of the worst-case execution-time (WCET) is sufScient to design a 
scheduler. For more complex scheduling algorithms with preemption or shared 
resources, the WCET for each task might not give rise to the WCET for the 
entire system though. This is why most critical embedded systems rely on a 
rather simple scheduling algorithm. Performance wise, determining tight bounds 
for WCET is crucial as using rough over-estimates might either result in a set 
of tasks being wrongly declared non schedulable or a lot of computation time 
might be wasted in idling cycles and loss of energy/power. 

The WCET Problem. The execution time, t\me{p,d,H), of a program p, 
with inpiit data d on the hardware H , is measured as the number of cycles of 
the fastest component of the hardware i.e., the processor. Data take their values 
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in a finite domain V. The program is given in binary code or equivalently in 
the assembly language of the target processoij^ The worst-case execution-time 
of program p on hardware H is defined by: 

WCET(p, H) = sup time(p, d, H). 

dev 

The WCET problem asks the following: Given p and H, compute WCET(p, H). 

In general, the WCET problem is undecidable because otherwise we could 
solve the halting problenj^ However, for programs that always terminate and 
have a bounded number of paths, it is obviously (theoretically) computable. 
Indeed the possible runs of the program can be represented by a finite tree. 
Notice that this does not mean that the problem is tractable though. 

If the input data are known or the program execution time is indepedent 
from the input data, the tree contains a single path and it is usually feasible 
to compute the WCET. Likewise, if we can determine some input data that 
produces the WCET (this might be as difficult as computing the WCET), we 
can compute the WCET on a single-path program. 

If is not often the case that the input data are known or that we can determine 
an input that produces the WCET. Rather the (values of the) input data are 
unknown, and the number of paths to be explored might be extremely large: 
for instance, for a Bubble Sort program with 100 data to be sorted, the tree 
representing all the runs of the (assembly) program on all the possible input 
data has more than 2^° nodes. Although symbolic methods (e.g., using BDDs) 
can be applied to analyse some programs with a huge number of states, they will 
fail to compute the exact WCET on Bubble Sort by exploring all the possible 
paths. 

Another difficulty of the WCET problem stems from the more and more 
complex architectures embedded real-time systems are running on. They usually 
feature a multi-stage pipeline and a fast memory component like a cache, and 
they both influence in a complicated manner the WCET. It is then a challenging 
problem to determine a precise WCET even for relativey small programs running 
on complex architectures. 

Methods and Tools for the WCET Problem. The reader is referred to [T] 
for an exhaustive presentation of the WCET computation techniques and tools. 
There are two main classes of methods for computing WCET. 

— Testing-based methods. These methods are based on experiments i.e., run- 
ning the program on some data, using a simulator of the hardware or the real 
platform. The execution time of an experiment is measured and, on a large 
set of experiments, a maximal and minimal bound can be obtained. The 

^ When we refer to the "source" code, we assume the program p was generated by a 
compiler, and refer to the high-level program (e.g., in C) that was compiled into p. 

^ Note this is true even for input data ranging over a finite domain, and can be proved 
using Konig's Lemma. 
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maximal bound computed this way is unsafe as not all the possible paths 
have been explored. These methods might not be suitable for safety critical 
embedded systems but they are versatile and rather easy to implement. 
RapiTime (2j (based on pWCET ^) and Mtime [4 are measurement tools 
that implement this technique. 
— Verification-based methods. These methods often rely on the computation 
of an abstract graph, the control flow graph (CFG), and an abstract model 
of the hardware. Together with a static analysis tool they can be combined 
to compute WCET. The CFG should produce a super set of the set of all 
feasible paths. Thus the largest execution time on the abstract program is 
an upper bound of the WCET. Such methods produce safe WCET, but are 
difficult to implement. Moreover, the abstract program can be extremely 
large and beyond the scope of any analysis. In this case, a solution is to take 
an even more abstract program which results in drifting further away from 
the exact WCET. 

Although difficult to implement, there are quite a lot of tools implementing 
this scheme: Bound-T [5], OTAWA 0, TuBound [7j, Chronos 0, SWEET 
and aiT |10|llj are static analysis-based tools for computing WCET. 

The verification-based tools mentioned above rely on the construction of a 
control flow graph, and the determination of loop bounds. This can be achieved 
using user annotations (in the source code) or sometimes infered automatically. 
The CFG is also annotated with some timing information about the cache miss- 
es/hits and pipeline stalls, and paths analysis is carried out on this model e.g., by 
Integer Linear Programming (ILP). The algorithms implemented in the tools use 
both the program and the hardware specification to compute the CFG fed to the 
ILP solver. The architecture of the tools themselves is thus monolithic: it is not 
easy to adapt an algorithm for a new processor. This is witnessed by WCET'08 
Challenge Report [T2] that highlights the difficulties encountered by the par- 
ticipants to adapt their tools for the new hardware in a reasonable amount of 
time. 

WCET and Model- Checking. Surprisingly enough, only a few tools use 
model-checking techniques to compute WCET. Considering that (i) modern ar- 
chitectures are composed of concurrent components (the stages of the pipeline, 
caches) and {ii) these components synchronize and synchronization depends on 
timing constraints (time to execute in one stage of the pipeline, time to fetch a 
data from the cache), formal models like timed automata [T3] and state-of-the- 
art real-time model- checkers like UPPAAL |14I15| appear well-suited to address 
the WCET problem. 

It has previously been claimed [16] that model- checking was not adequate to 
compute WCET, but this statement has since been revised. In [17], A. Metzner 
showed that model-checkers could well be used to compute safe WCET on the 
CFG for programs running on pipelined processors with an instruction cache. 

In [T^, B. Hubert and M. Schoeberl consider Java programs and compare 
ILP-based techniques with model-checking techniques using the model-checker 
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UPPAAL. Model-checking techniques seem slower but easily amenable to chan- 
ges (in the hardware model). The recommendation is to use ILP tools for large 
programs and model-checking tools for code fragments. 

More recently, the TASM toolset 19 (M. Ouimet & K. Lundqvist) has been 
used to compute WCET with UPPAAL: the TASM machine is a high level 
machine not featuring pipelining nor caches and computing the WCET amounts 
to finding the longest path (timewise) in a timed automaton that specifies a 
tasks. 

Another use of timed automata (TA) and the model-checker UPPAAL for 
computing WCET on pipelined processors with caches is reported in ^20 . The 
framework METAMOC described in [301 (A. E. Dalsgard et al.) consists in: 1) 
computing a flow graph (EG) from a binary program, 2) composing this EG 
with a (network of timed automata) model of the processor and the caches. 
Computing the WCET is then reduced to a safety (or dually a reachability) 
property AG (Time < k) (reads "on all paths, the variable Time, global time, is 
less than fc") that can be checked with UPPAAL. 

The previous framework is extremelly elegant yet has some shortcomings. 
Out of the 15 programfj^of the Malardalen University benchmarks only 7 can be 
analysed with a concrete instruction and data cache (Table .6.1, page 84 in [2U]). 
It is also surprising that some single-path programs could not be analysed with 
concrete caches. The tool chain relies on a value analysis tool which fails on 3 of 
the 15 programs. It requires a specialised version of UPPAAL (not available) to 
avoid a binary search for computing the WCET. 

Our Contribution. In this paper we use timed game automata (TGA) and 
UPPAAL-TiGA [21 (UPPAAL for timed games) to compute WCET. We model 
the WCET problem as a two-player timed game. Intuitively Player 1 is the 
program, and Player 2 is in charge of deciding the outcome of the comparison 
instructions (e.g., cmp, tst which set the branching conditions) that depend on 
the input data. As the choice of the input data is not controllable by Player 1, 
we obtain a two-player game. The problem we solve on this game is an optimal 
time reachability problem: 

"What is the optimal time for Player 1 to reach the end of the program ?" 

What is similar to the previously mentioned approach [20] (A. E. Dalsgard et 
al.) is the timed automata models for the cache^ and pipeline stages i.e., the 
model of the architecture, but we use a totally different model for the program. 
We propose a new and very compact encoding of the program and pipeline 
stages' states which enables us to compute the WCET for 13 out of the previous 
15 program^ (see Table [l] page 28). Moreover, compared to METAMOC that 



uses a computer with 32GB RAM, we can compute the results on a laptop 



^ The benchmarks contain 35 programs. In [J^, only 14 programs can be analysed 
with a concrete instruction cache and 7 with a concrete instruction and data cache. 
* Note that a similar model is reportedly due to A. P. Ravn in [18) . 
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computer (2Ghz Dual Core, 2GB RAM) within a few seconds. Using timed games 
instead of timed automata is also a major difference: the on-the-fly algorithm [22] 
implemented in UPPAAL-TiGA is different from the one running in UPPAAL, 
and it can also compute the optimal time (in the presence of adversary) to reach 
a designated state. Thus we do not need to do a binary search or use a tailored 
version of UPPAAL to compute the results. 

We also show that taking into account processor speed variations is easy in 
our framework. This can be important as it is possible to adjust the speed of the 
processor depending on the program to be run. For some programs, the saved 
power can be upto 22% (see Table [l}. 

The advantages of our approach are many- fold (METAMOC [20| shares 1-3): 

1. it is very easy to implement as it consists of two separate and independant 
phases: 1) computation of a model of the program to be analysed; this only 
requires a (formal) semantics of the assembly language of the target proces- 
SO10 2) computation of the WCET with UPPAAL-TiGA and the models 
for the caches, pipelines which specify the timing features. A model of a 
cache (e.g., always miss or FIFO) can be substitued by changing the cache 
component only (no need to recompute the model obtained in phase 1). 

2. the design of the models for pipeline stages and caches can be stressed by 
simulating some simple samples programs; this enables us to get more confi- 
dence in the model of the hardware as this is not hidden in the analysis algo- 
rithm; this is especially important for concurrent architectures like pipelined 
processors that can be hard to describe; 

3. UPPAAL or UPPAAL-TiGA can be used to simulate the program on the 
architecture. It is thus a quick way of obtaining a simulator for a given 
hardware; 

4. we do not require annotations. Instead, we run a simulation of the program 
with some given bounds on the number of branching or a maximal number 
of states. If too many branchings are encountered, the user is required to 
provide a constraint for the corresponding instruction in the program to 
remove some infeasible paths; 

5. we solve an optimal time reachability problem on the program p of the form: 
"what is the optimal time to enforce termination of program p ?" . This 
at once 1) proves that p terminates on every input data, and 2) computes 
the WCET. This could not be achieved in METAMOC [20] as the UPPAAL 
model contains priorities and deadlock freedom cannot be checked on models 
with priorities: thus if the safety property AG (Time < k) is satisfied, it does 
not mean that no deadlocks occurred; the deadlocks could be due to a flaw 
in the design of the pipeline model but in any case, it does not give a safe 
bound for the WCET as deadlocks have not been excluded. 

6. it is easy to add power related constraints in the model e.g., processor speed 
variations; 

® In contrast, the verification-based tools would need a description of the hardware to 
compute the CFG. 
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7. we also show that not every program instruction is worth simulating and 
some abstraction on the effect of some instructions can be safely done. For 
example, in the Fibonnacci program, the content of the variable with the 
result is irrelevant for the computation of the WCET. It does not influence 
any branching nodes. We show how to check that an abstract program is 
equivalent to a concrete one and examplify this on some of the benchmarks 
from Malardalen University. 

Outline of the Paper. In Section [2j we briefly introduce the ARM9 architec- 
ture and the assumptions we make on the assembly programs to be analysed. 
Section |3] describes how to encode an assembly program with non-deterministic 
choices into a game. In Section |4] we give the timed automata models of the 
architecture we use to compute the WCET. Section [5] gives an overview of the 
tool chain we propose and the components (compiler) we have designed together 
with some comments on the case studies presented in Table [T] 

2 Concrete and Abstract Programs 

Program, Registers, Memory. A program p is a list of instructions p = 
11,12, • • ■ ,ik and ii is the initial instruction. The control usually goes from in- 
truction i^. to ik+i except for branching intructions that give the next instruction 
ij to be performed. Each instruction performs some basic operations (arith- 
metic, logic, memory load or store, branching) and has a duration which gives 
the amount of time it takes in each stage of the pipeline of the processoi]^ We 
assume the duration is indcpendant from the content of the operands of the 
instructiont]^ In the sequel we use the variable l to denote an instruction of p. 

The hardware on which p runs has a pool of registers (different from the 
main memory and the caches). We let TZ = {ro, ■ ■ ■ ,rk} be the set of registers. 
For example on the ARM9 [23] processor there are 16 registers. A designated 
register pc contains the program counter and points to the next instruction to 
be performed (register 15 on the ARM9). 

We let A4 — {mi, TO2, • • • , m„} be the set of memory cells' addresses used 
by the program (we assume the program can access Ai). The content of the 
memory cells and registers is in a finite domain V (e.g., 32 bit integers). 

Semantics. When program p runs on input data d, it generates a computation 
that changes the values of the registers and memory cells. 

A state (of the computation of p) is given by a mapping v : TZU M ^ V and 
we let V be the set of states. 

A particular case is a processor with one stage. 
* This is not always the case as for instance the duration of the instruction mull 
(multiplication on long integers) on the AMRM9 depends on how large one of the 
operand is. However, we can always take the longest duration to obtain a safe upper 
bound of the WCET. 
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Performing an instruction results in a state change, and is deterministic. 
Given an instruction t (/, including the operands and can be thought of as the 
code of an assembly instruction), the semantics of t is a mapping |i] : V — >■ V. 

As the program counter is in one of the registers, the semantics of a program 
p is completely determined by the current state of the computation. From a state 
V, the next state (in the computation oi p) is v' and we denote this v ^ v' . v' 
is given by where and i = ipc (we use pc both for the register and the 

content of this register to avoid hefty notations). 

For branching instructions, the control is determined by the status bits and 
we assume there are also part of the pc register. 

Remark 1. We assume pc is incremented by 1 after each instruction (except for 
branching instruction). In an actual computer, it is incremented by the word size 
but these details are irrelevant at this stage. 



00000000 <inain>: 



Side Effects of an Instruction. Each instruction reads from and writes to 
some subset of registers. We let regR{L) (resp. regW{L)) be the set of "read from" 
(resp. "written to" ) registers for instruction b. 

Each instruction can also read or write to main memory cells. We let memR{i) 
(resp. memW{t)) be the set of memory cells addresses read from (resp. written 
to) by instruction t. 

An example of an assembly program is 
given in Listing This program per- 
forms a binary search on an array of 
14 elements. Line 24 loads register r3 
with a value of the array at address 
v{r4) + {v{r2) * 8). As we do not know 
the values of the array, the value of 
r3 is unknown after this instruction. 
rO contains the value we are looking 
for, and is also unknown^. As a con- 
sequence, the comparison of line 2c is 
undetermined as the value of r3 in un- 
known. The outcome of the comparison 
is used later in conditional instructions 
(e.g., Idreq r5, [rl, #4] and subgt 
ip,r2,#l) and branching instructions 
beq 44. Two status bits are needed to 



c 


eSaOOOOg 


raov 


rO, 


#9 ; 0x9 


4 


eaf ffff f 


b 


8 <binary_search> 


00000008 <binary_search> 




8 


e92d4030 


stmdb 


sp! 


, {r4, r5, lr> 


c 


e59f4040 


Idr 


r4. 


[pc, #64] : 


10 


eSaOeOOO 


mov 


Ir, 


#0 ; 0x0 


14 


e3a0c00e 


raov 


ip. 


#14 ; Oxe 


18 


e3e05000 


ravn 


r6. 


#0 ; 0x0 


Ic 


e08e300c 


add 


r3. 


Ir , ip 


20 


ela020c3 


raov 


r2. 


r3, asr #1 


24 


67943182 


Idr 


r3. 


[r4, r2, Isl #3] 


28 


e0841182 


add 


rl. 


r4, r2, Isl #3 


2c 


61530000 


crap 


r3. 


rO / eq le / 


30 


05915004 


Idreq 


r6. 


[rl, #4] 


34 


024ec001 


subeq 


ip. 


Ir, #1 ; 0x1 


38 


OaOOOOOl 


beq 


44 


<binary_search+0x3c> 


3c 


C242c001 


subgt 


ip. 


r2, #1 ; 0x1 


40 


d282e001 


addle 


Ir, 


r2, #1 ; 0x1 


44 


el5e000c 


crap 


Ir, 


ip / le / 


48 


claOOOOS 


raovgt 


rO, 


r5 


4c 


daf ffff 2 


ble 


Ic 


<binary_search+0xl4> 


50 


e8bd8030 


Idmia 


sp ! 


, {r4, r5, pc} 


54 


00C00158 


andeq 


rO, 


rO, rS, asr rl 



Listing 1.1. Binary Search Program 



encode the result of the comparison at line 24: whether rS is "lower or equal" 
than rO and whether rS is "equal" to rO. This is indicated by the two predicates 
eq and le between / . . . /. The address of the memory cell referenced at line 24 
is determined by the previous outcomes of the comparison instruction at line 2c. 



In the actual program it is 9 but it does not change the execution tree of the program. 
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Runs. A run of program p from state vq (initial value of the input data) is the 
(unique) sequence of instructions performed by p from vq- 

P{P, vo) = i-i ■■■ ik ■■■ i-n 

with ii = ii. The length of the run p{p,vo) is |p(p, wo)| = n. We assume that 
every run terminates, and that moreover, given p, there exists a contant Kp 
s.t. yv G V, \p{p,vo)\ < Kp. Intuitively, this means that all loops are bounded, 
and it implies that there is no run which encounters twice the same state. 

The state after the subsequence ti • • • tfe is determined by the composition 
of the semantics function of each instruction. If vj is the state after instruction 
Lj then Vj+i = Itj+iJivj), and t;o is the initial state. 

Execution Time of a Run. If each instruction was performed one after the 
other, the execution-time of a run would be the sum of the execution times of 
each instruction. 

On pipelined architectures with caches, the execution-time solely depends on: 

1. the subsequences of instructions: pipeline stalls can occur, for instance be- 
cause one instruction (e.g., in the execute stage) reads a register written to 
by the instruction in the next stage (e.g., memory stage). 

2. the time to read or write a memory cell: instructions that require memory 

transfers (load and store) might take different durations if a cache is used, 
depending on whether the memory cell is already in the cache of not. 

We let H denote the architecture of the system. H refers to the pipeline structure 
and timing specifications, the cache initial state, size, replacement policy and 
timing specifications, and the timing specifications of the main memory. The 
execution-time of a run p is completely determined by: 

— the architecture H, 

— the duration of each instruction of p in each stage of the pipeline, 

— the registers read from and written to, and memory cells read from or written 
to by each instruction of p. 

The duration of a run p on architecture H is denoted t\meH{p)- This function 
might be rather complex but is yet well-defined. 

To formalize the previous informal definition, assume the architecture H is 
fixed. Let p = bi ■ ■■ in and p' = i'l ■ ■■ t'^ be two runs of program p. We say that 
p and p' are (time-wise) H-equivalent and write p ~jj p' if for each 1 < fc < n: 

— the duration of tk in each stage of the pipeline is the same as the duration 

of 4; 

— the registers used as operands and memory cells referenced are also the same: 
'P{i'k) = 4'{''k) for 4> G {fsgRj regW, memR, memW}. 

Fact 1 If p p' then timeij(p) = t\meH{p')- 

The worst-case execution-time for program p on architecture H is given by: 
WCET(p,if) = maxt\meH{pip,vo)). 
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Timing Anomalies. Timing anomalies [T] can occur because of the complex 
architecture of the hardware H. The term refers to counter-intuitive observations 
in the sense that larger local execution-times may not result in larger global 
execution-times. Pre-fetching instructions can lead to such observations on some 
processors. This can also be observed on complex pipeline architectures (e.g., out- 
of-order execution of instructions). 

On architectures that do not exhibit timing anomalies, the function time/j is 
in some sense monotonia. 

For instance an achitecture with an "always miss" cache (or equivalently 
no cache) will produce a WCET which is always greater than on an architecture 
H with a cache of size more than 1. As we consider worst-case execution-time, a 
random replacement policy for a cache is equivalent to an "always miss" cache. 
Let Hr denote a cache with random replacement policy, and H a regular cache 
(LRU, FIFO, semi-random replacement policy). The following holds: 

Fact 2 WCET(p,iJ) < WCET(p,i?^) ^ WCET(p,iJ^). 

This implies that an over-approximation of WCET(p, H) can always be obtained 
using an equivalent architecture H' with an "always miss" cache. 

The same remark applies for the pipeline of architecture H. If H' is the same 
as H with larger durations for each instruction at each stage, then WCET(p, H)< 
WCET(p, i?'). If a pipeline stall in H implies a pipeline stall in H' for every 
program and every input data, then WCET{p,H) < WCET(p, iJ'). 

Another interesting case is when a branch instruction is executed. If it is 
not a loop, the program fragment has a diamond shape; both branches join at 
some future point in the computation. If the local worst-case execution time is 
obtained by taking one side of the branch instruction, we can safely ignore the 
other side as it does not contribute (more) to the global worst-case execution- 
time. 

The framework of this paper does handle timing anomalies, but some abstrac- 
tions defined below are not safe for architecture exhibiting timing anomalies. 

Abstractions. In this section we introduce some simple abstractions that can 
be made on a program p. The aim of this abstraction is to reduce the space 
needed to encode the state of the computation. We examplify the usefulness of 
these abstractions on some benchmarks programs from Malardalen University. 
Listing [r2] (Figjl]) gives a C function computing the Fibonacci number n. 



Its assembly language version is given in listing |1.3[ The control flow of the 
assembly version is controlled by lines 20, 24 and 30: register r2 contains the 
loop variable i and is incremented at each round. Lines c, 10, Ic, 28 and 2c are 
not contributing to the program control flow. If we are only interested in the 
execution-time of this program, their effects can be safely abstracted away. We 
can replace them by equivalent instructions that modify only the pc register, 
with the same read/written registers (and memory cells if it happens to be a 
load/store instruction). For instance, instruction mov at line c, can be replaced 
by an abstract instruction mov° with: 
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1: int f ib C int n) 

2: { 

3: int i , Fnew , Fold , temp , ans ; 

4: Fneu=l;Fold = 0; 

5: f or Ci=2 ; i<=30 && i<=n; i++) 

6: { 

7: temp = Fnew ; 

8: Fnew=Fnew + Fold; 

9: Fold=temp; 
10: } 
11: ans = Fnew ; 

12: return ans ; 

13: } 



Listing 1.2. C Program 



0: 


mov 


r2. 


#2 ; 


0x2 


4: 


cmp 


r2. 


rO 




8: 


mov 


ip. 


rO 




c : 


mov 


rO, 


#1 ; 


0x1 


10: 


mov 


rl. 


#0 ; 


0x0 


14: 


movgt 


PC 


Ir 




18: 


add 


r2. 


r2, #1 ; 


0x1 


Ic: 


mov 


r3. 


rO 




20: 


cmp 


r2. 


#30 ; 


Oxle 


24: 


cmple 


r2. 


ip 




28: 


add 


rO, 


rO, rl 




2c: 


mov 


rl. 


r3 




30: 


ble 


18 


<f ib+0xl8> 




34: 


mov 


pc. 


Ir 





Listing 1.3. Assembly Code 



Fig. 1. Fibonacci Program. 



— |mov°|(u) = v' with v'{r) — v{r) for each register different from pc and 
t;'(pc) = t;(pc) + 1; 

— the duration of mov" in each stage of the pipehne is the same as mov; 

— the registers read from/written to by mov° at hne c are the same as the ones 
read from/written to by instruction mov at hne c. 

In the end, we can abstract away the values of registers rO, rl and r3 and 
assume they are always as no abstract instruction will modify them. The 
WCET of the abstracted program will be exactly the same as the concrete one. 

The goal of this abstraction is to reduce the space needed to encode a state 
of the computation. Instead of encoding 7 registers, only 4 are relevant for the 
computation of the WCET. 

A valid abstract program must simulate the execution tree of the concrete 
program. To be equivalent WCET-wise to the concrete program, it should also 
preserve the addresses of the referenced memory cells to ensure that cache hit- 
s/misses are preserved. 

To formalize the previous notions, we first define critical instructions. A 
critical instruction is an instruction that: 

(i) either sets some status bits; it can be a comparison or test (e.g., cmp, tst) or 
an arithmetic instruction with the "s" flag on the ARM9 (e.g., a subtraction 
subs r2, r2, #1); 

iii) or an instruction that references a memory cell e.g., Idr rO, [r2, r3 Isl #2] 
(load register rO with the content of memory cell r2 + (r3 x 4)). 

Next we define abstract instructions. As examplified for the mov instruction at 
line c previously, given an instruction t, the abstracted instruction t° is defined 
by: 

— the semantics of is [6''](u) = v' with v'{x) — v{x) for each register x 
different from pc and each memory cell x vn M., and w'(pc) = v{pc) + 1; 
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— the duration of in each stage of the pipehne is the same as the duration 
of t; 

— the registers read from/written to by t° are the same as the ones read 
from/written to by instruction l: (f){L) = for (j) G {regR, regW, memR^ 
memW}. 

Let = i° • • • z° be the abstract program that corresponds to p — ii • • • i„. 
An abstraction mapping a is a mapping that associates with each (concrete) 
instruction i of p, either t (identity) or t° (a determines whether t is abstracted 
or not). We write for a(i). 

Let p{p, vq) = L1L2 ■ ■ ■ i-k be a run of p from vq and p{p°',Vo) — i^tj • ■ • 
the corresponding a-abstracted run. Let Ic{p, vq) Q {1, 2, • • • , fc} be the set of 
indices s.t. j + 1 G Ic{p,vo) ij+i is a critical instruction in p{p,vq). Let 

Vj be the state after executing instruction j in p{p,vo) and be the state after 
executing abstract instruction j in p{p'^,vo). 

The following Lemma states that, if the values of the registers read from/writ- 
ten to by any critical instruction (in p(p, Vq)), are equal to the values of the same 
registers in the abstract execution, the execution time of the concrete and ab- 
stract run is the same. 

Lemma 1. // Vj + 1 e Ic{p{p,vo)), Vj{r) = w"(r) for each r £ regR{Lj+i) U 
regW{ij+i) then t\meH{p{p,vo)) = time/f(p(p", wq))- 

Proof. If the values of the operand registers of each critical instruction Lj are 
the same in the concrete and abstract runs before performing Oj and t", then: 

1. the status bits that are set by the critical instruction have the same values 
in the concrete and abstract state; 

2. the addresses of the memory cells referenced by the instruction are the same 
in the concrete and abstract run. 

The concrete and abstract run are thus iJ-rquivalent, i.e., p{p,vo) pIp^tVo). 
By Fact[l] it follows that time/f (p(p, wp)) ~ time//(p(p", wg)). □ 

If Lemma [1] holds for each run p{p,vo) with vq e T), we say that p and p" are 
i?-equivalent and write p p". In this case, by definition of the WCET, we 
have: 

Lemma 2. If p then WCET(p, H) = WCET(p", H). 

Context Independence. As we cannot simulate p for every input data, we 
assume that the initial values of these data can be arbitrarily chosen. To formalize 
this, we use an extended domain for the values of the registers and memory 
cells: V U {_L} where _L is a special unknown value. At the beginning of the 
computation, every register (except pc) and memory cell has its value set to _L. 
The initial state is thus Wq with Vo{x) = _L for a; € (72.\{pc})UA^ and f (pc) = ii 
where ii is the address of the first instruction of program p. 

We assume that for each program p, the addresses of the memory cells ref- 
erenced during the course of the execution of the program, only depend on the 
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current state and are independent from the input data values. By this, we mean 
that the address referenced at each point in a run of a program is determined by 
some registers vahies that are known. These values may depend on the actual 
content of some memory cells because they influence the branching instructions, 
but once a branch is chosen, the addresses can be computed. An example is a 
binary search program: we have to determine wether a sorted array v contains 
a value s. The search continues as long as s has not been found. 

The semantics of each instruction (next state) is extended to the extended 
domain VU {±} as follows: 

— for arithmetic and logical instructions, the value of the result of an instruc- 
tion is _L if the value of one of the operands is _L; 

— for instructions that set the status bits, there might be more than one next 
state; if one operand is _L, the next states are given by all the possible values 
of the status bits; 

— for memory transfer instructions (load, store with addresses in Ai) the result 
in memory or register is always _L. Nevertheless, for transfers involving the 
stack (a subset of the addresses in Ai), we keep track of the values pushed 
or popped. The stack is quite often used on call/return of a function, and 
abstracting the content of the stack would result in some infeasible paths, 
or even to references to forbidden memory cells. 

— for branching instructions, there is one next state determined by the value 
of the target (unconditional branching) or by the status bits (conditional 
branching) . 

From the previous extended definitions, there might be more than one run from 
the initial extended state vo . We denote p± the non-deterministic program that 
corresponds to p on the extended domain. The semantics of p± is a tree, tree(j>j^) 
where the branches correspond to the choices of the status bits when required. 
Note that this tree might be unbounded. 

An important property of this tree, is that if p{p,vq) is a nm of p on input 
data vo, there is a path p' in tree(pj^) that satisfies p{p,vo) p' ■ Moreover, 
as we assume that the number of steps when running p is bounded by Kp, we 
can safely truncate the tree tree(p_L) and prune all nodes that are more than Kp 
steps apart from the root. Let Runs{p±) denote the set of rooted paths in the 
tree tree(p_L). We assume tree(p_L) has depth at most Kp. Let 

WCET{p_i,H)= max t\meH{p). 

p^Runs{p±) 

As every run of p is simulated by a run p±, we have: 

WCET{p,H) < WCET{p_L,H). 

Moreover, we can also define an abstract version, pj, of p_L, given an abstrac- 
tion mapping a. The definitions are extended to te extended domain. As before 
we have: 

Lemma 3. Ifp± pI, then WCET{p_L,H) = WCET(p^,iJ). 
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Combining Lemma [5] and Lemma [3] we have: 
Lemma 4. Ifpj_KHPl, WCET{p,H) < WCET{pl,H). 

Checking that p" =h p. Checking whether px^ k,h can be done by building 
a synchronized product of pj_ and and checking wether each state preceeding 
a critical instruction satisfies the condition of Lemma ^ 

This is implemented in our framework (see Fig. [6]) by generating a CH — h file 
that performs this check. 

Table [T] column Abs gives the ration of abstracted instructions for some 
programs (when we have chosen to abstract away some instructions). For some 
programs (matmult and jf dcint) the number of abstracted instructions is rather 
high. This indicates that the control fiow is quite simple and governed by a small 
number of instructions. 

Notice that this abstraction does not change the WCET of the program. 



3 From Programs to Games 

In this section we describe how to encode an assembly program into a game. The 
encoding can be applied to any assembly language but we give examples for the 
ARM9 processor. 

Given a program p, we define a two-player game to model the runs of pj_ 
defined in the previous section. Player 1 executes the instructions of p^. The 
role of Player 2 is to set the values of the status bits when an instruction that 
modifies them is encountered and some operands have unknown values, the result 
is undetermined. The outcome is thus picked up non-deterministically. 

On the ARM9 processor, there are 4 status bits. A simple encoding would 
be to have 4 boolean variables to model the value of each bit. As we let Player 2 
choose the outcome, this corresponds to choosing four values for Player 2: N 
(negative), Z (zero), V (overflow) and C (carry). This could create 2^ = 16 
different next states and thus as many new potential branches in the game. 
Most of the time, it is not necessary to know the actual values of the 4 status 
bits. For instance the result of a comparison instruction cmp rO, rl with, say rl 
unknown, could be used later on only to check wether rO = rl. In this case the 
value of the Z-status bit is required but the values of the other status bits are 
irrelevant. 

To reduce the number of branches (choices of Player 2) in the game, we 
determine, for each instruction t that sets a status bit, the next instructions 
that depend on the result of b. This can be computed on the program p. For 
each instruction l that sets a status bits, we let flags{b) be the set of predicates 
used after t. For instance in the example code of Listing 1^ Fig. [l] page [lOj 



the result of the instruction cmp r2, rO line 4 is used at line 14, and the only 
predicate needed is gt (i.e., whether r2 > rO). In the worst case we still need 4 
variables to encode the outcome of an instruction l that sets the status bits, but 
we reduce the choices of Player 2 to the predicates in flags{L). In the previous 
examples, instead of having 16 branches, there will be only 2. 
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To model program p± in UPPAAL we need: 



— an array, val, of 16 variables for the registers of the ARM9 processor; 

— 4 boolean variables for the status bits (we use cmple, cmplt, cmpls, cmpeq 
instead of the actual status bits N, Z, V and C, but this is equivalent); 

— a stack of size K (the size of which has been determined in a previous stage). 

Although the model-checker UPPAAL that we use is extremely efficient, we 
have to be careful when encoding p±: some information can be encoded using 
variables, but they will be part of the state of the network of TA we build, and 
will be encoded in the BDD representation of each state. Some information are 
not dynamic but rather static (e.g., the type of an instruction i, or the registers 
read/written regR[b) and regW{b)) and can be encoded using UPPAAL functions. 
This saves space as functions are not part of the encoding of a state. Given a 
program pj^, we define the functions: 

— SetStatusB : p — > B which, given an instruction l Cz p, returns true if l sets 
some status bits (comparison instructions cmp.tst and instructions with the 
"s" fiag like subs, adds etc); 

— cmp U : p X V± — > B which returns true if the result of the instruction t in 
state V is unknown. 

As a shorthand we write NDcmp{L,v) = SetStatusB{i) A cmpU{L,v) and this 
indicates whether instruction l, when executed from state v, should be played 
by Player 2 (the status bits should be set but an operand is unknown). 

In addition to this, we define another function update : V± — > Vj^ which 
updates the values of the registers and the status bits if required: this function 
encodes the semantics of each instruction on the extended domain. 



The result for the Fibonnaci program of Listing 1.4 page 15 are given in 



Listings|1.5|and[l.6| These listings call for some comments: 



Listing 1.4 contains the assembly code generated by objdump after compiling 
the C program with gcc; the instructions that set status bits have been 
annotated (e.g., lien 4 / le /) by the predicates that should be set by the 
instruction (le in this case for instructions at lines 4, 20 and 24). 
Listing |1.5| contains the functions that determine whether the result of an 
instruction that sets the status bits is undetermined. UNKNOWN is a special 
valuj^ For instance, if the value of r2 is unknown when executing instruc- 
tion (hexadecimal) 20 (decimal 32), cmpU returns true and SetStatusB as 
well. 

Listing |1.6| contains the updates of the registers in the extended domain. 
The updates of an instruction are performed only if it is not abstracted 
away (is_abstracted function, not given here, but we can assume for now 
it always returns false.) The instruction cmp r2,r0 (UPPAAL translation 
lines 13 to 20) sets the cmple variable according to the values of r2 and rO. 
If at least one of the values of r2 and rO is unknown, the value of cmple will 



We use an integer that is never used as an actual value in the content of any register. 
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be chosen right after the update step by Player 2, overriding the previous 
value. 

The instruction cmp r2,r0 is unconditional, and it has to be scheduled for 
execution. This is carried out by function SET (-,-,-) which sets 3 values 
(in the first stage of the pipeline, see section |4]): the label of the instruction 
(4), the memory addresses referenced by the instruction (—1 indicates no 
memory addresses), and wether the instruction is scheduled or not (1 in this 
case) . 

For conditional instructions, e.g., movgt pc, Ir, (UPPAAL translation lines 24 
to 37), if the function gt() returns true, the instruction is not scheduled 
(SET(20,-1,0)). Function gt() returns the complement value of cmple that 
has been set by the comparison instruction (or Player 2 if some operands 
were unknown) before. 

The last parameter of SET(- , - , - ) has no meaning for conditional branching 
instructions as they are always scheduled. We use it to indicate whether the 
condition evaluates to true or false. An example is instruction ble 18 



(UPPAAL translation lines 76 to 83 in listing 1.6). If the condition (function 
leO) evaluates to true this parameter is true and false otherwise. This 
information is used to simulate pipeline flushes when a branch prediction is 
wrong. 



00000000 <fib>: 













e3a02002 


mov 


r2. 


#2 ; 0x2 




4 


el520000 


cmp 


r2. 


rO 


/ le / 


8 


elaOcOOO 


mov 


ip. 


rO 




c 


eSaOOOOl 


mov 


rO, 


#1 ; 0x1 




10 


eSaOlOOO 


mov 


rl. 


#0 ; 0x0 




14 


claOfOOe 


movgt 


PC 


Ir 




13 


e2822001 


add 


r2. 


r2, #1 


; 0x1 


Ic 


elaOSOOO 


mov 


r3. 


rO 




20 


e352001e 


cmp 


r2. 


#30 ; Oxle 


/ le / 


24 


dl52000c 


cmple 


r2. 


ip 


/ le / 


28 


eOSOOOOl 


add 


rO, 


rO, rl 




2c 


ela01003 


mov 


rl. 


r3 




30 


daff fffS 


ble 


18 


<f ib+0xl8> 




34 


elaOfOOe 


mov 


PC 


Ir 




00000038 <main>: 










38 


elaOcOOd 


mov 


ip. 


sp 




3c 


e92ddS10 


stmdb 


sp ! 


, {r4, fp, ip 


Ir, pc} 


40 


e3a0401e 


mov 


r4. 


#30 ; Oxle 




44 


e24cb004 


sub 


fp. 


ip, #4 


; 0x4 


48 


ela00004 


mov 


rO, 


r4 




4c 


ebff ffeb 


bl 


<fib> 




50 


ela00004 


mov 


rO, 


r4 




E4 


e91ba810 


Idmdb 


fp. 


{r4, fp, sp, 


pc} 



Listing 1.4. Complete Assembly Code 



1: 


/* 


function to determine whether status bits should ne 


set */ 


2: 


bool SetStatusBCint i) ■[ // i is the PC of instruction ; 


function that tells whether 






status bits should be set 




3 




// 


comparisons for function fi b 




4 




if 


( i ==4) { // setting status bits for instru ction cmp 


it 4 [0x4 ] 


5 






return true ; 




6 
7 




> 

if 


Ci==32) { // setting status bits for instru ction cmp 


at 32 [OxBOj 


8 






return true ; 




9 




> 






10 




if 


Ci==36) { // setting status bits for instru ction cmp 


at 36 [0x24 j 
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11: return true ; 

12: } 

13: // comparisons for function main 

14: return false ; 

15: } 
16: 

17: /* comparisons for instructions used in the program */ 

18: bool cmpU ( int i ) { 

19: /* comparisons for fund ton fib st arting ending 52 */ 

20: if Ci==4) return val [r2 ] == UNKNOWN I I val [rO] ==UNKNDWN ; // [Ox^.} 

21: if Ci==32) return val [r2] ==UNKNDWN ; // [0x20] 

22: I if Ci= = 36) return val [r2] = = UNKNaWN I I val [ip] ==UNKNDWN ; // [0x24 J 

23: I /* comparisons for function main starting 50 ending 84 */ 

24: return false ; // none if not found 

25: } // end comp of instruction 

26: 

27: /* setcmp for instructions used in the program * / 

28: void set cmp C int i,bool nl,bool n2) { 

29: / * res-comp for function fib starting ending 52 * / 

30: if (i= = 4) ■[ // instruction cmp r2 , rO at 4 [^^4 J 

31: craple=nl ; 

32: } 

33: if Ci==32) { // instruction cmp r2 , #30 at 32 [0x20] 

34: craple=nl ; 

35: } 

36: if Ci= = 36) { // instruction cmple r2 , ip at 36 [0x24 ] 

37: craple=nl ; 

38: } 

39: /* res_comp for function main starting 56 ending 84 */ 

40: } // end, setc mp of instruction 

41: 

42: bool NDcmpCint i) { 

43: return SetStatusB (i) && cmpU(i) ; 

44: } 

45: 

46: / * setcmp for instru cti o ns used in the program * / 

47: void set cmp C int i,bool nl.bool n2} { 

48: / * setcmp fo r function fib starting ending 52 */ 

49: if (i= = 4) ■[ // instruction cmp r2 , rO at 4 [^^4 J 

50: cmple=nl ; 

51: } 

52: if Ci= = 32) { // instruction cmp r2 , #30 at 32 [0x20] 

53: craple=nl ; 

54: } 

55: if Ci= = 36) { // instruction cmple r2 , ip at 36 [0x24 ] 

56: cmple=nl ; 

57: } 
58: 

59: /* res-comp for function main starting 56 ending 84 */ 
60: 

61: } // end setcmp of instruction 



Listing 1.5. C Code for SetStatusB and cmpU 



1: void update () {, // update function 

2: int nextpc , nextf p , tmp ; 

3: A 

4: updates fo r function fib starting ending 52 

5: V 

6: if (val[pc]==0) { // Instruction mov r2 , #2 at 0x0 
7: nextpc = val[pc]+4; 

8: if C!is_abstracted(val[pc])) { // effe ct of instru cti on is null if ab str acted 

9: val [r2] = (2) ; 

10: } 

11: SET (0,-1,1); // instruction scheduled is , no memory access and scheduled 

12: } // end mov at 0x0 

13: if Cval[pc]==4) { // Instruction cmp r2 , rO at 0x4 
14: nextpc=val[pc]+4; 

15: if (!is_abstractedCval[pc]}) { // effe c t of instru cti on is null if abstracted 

16: // Should set the Z and N and C bits 

17: if C (val [r2] -(val [rO] ) ) <=0) cmple=l ; else cmple=0; 

18: } 

19: SET (4, -1,1); // instruction scheduled is 4 • memory access and scheduled 

20: } // end cmp at 0x4 

21: 

22: 
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if (val[pc]==20) { // Instruction movgt pc , ir at 0x14 
nGxtpc=val [pc] +4; 
if CgtO) { 

if C!is_abstractedCval[pc])) { // effect of instruction is null if ah str acted 
if Cval [Ir] ==UNKNDWN) { 
val [pc] =UNKNOWN ; 

} 

else { 

nextpc = Cval [Ir] ) ; 

} 

> 

SET (20, -1,1); // instruction scheduled is 20 , no memory access and scheduled 

} 

else SET (20, -1,0) ; // instruction not scheduled . no mem access 
y // end, movgt at 0:cl4 

if Cval [pc] = = 24) { // Instruction add rB , r2 , #1 at 0x18 
nextpc=val [pc] +4; 

if C!is_abstracted(val[pc])) { // effe ct of instru ction is null if abstracted 
if Cval [r2]==UNKNDWN) i 
val [r2] =UNKNDWN ; 

> 

else { 

val [r2] = (val [r2]+l) ; 

} 

} 

SET (24, -1,1); // instru ction scheduled is B4 • i^o memory access and scheduled 
} // end add at 0x18 



if Cval [pc] = = 32) { // Instruction cmp r2 , #30 at 0x20 
nextpc=val [pc] +4; 

if C!is_abstracted(val[pc])) ■[ // effe ct of instruction is null if abstracted 
// Should, set the Z and N and C bits 
if ( (val [r2] - (30) ) < = 0) cmple=l ; else cmple=0; 

} 

SET (32,-1,1); // instruction scheduled is 32 , no memory access and scheduled 
y // end cmp at 0x20 

if (val[pc]= = 36) { // Instructio n cmple r2 , ip at 0x24 
nextpc=val [pc] +4; 
if (leO) { 

if (! is_abstracted( val [pc])) { // effect of instruction is null if abstracted 
// Should set the Z and N and C bits 

if CCval[r2]-(val[ip]))< = 0) cmple =1 ; else cmple = ; 

} 

SET (36, -1,1); // instruction scheduled is 36 , no memory access and scheduled 

y 

else SET (36, -1,0) ; // instruction not scheduled . no mem access 
y // end cmple at 0x24 



if Cval[pc]==48 kk (!leC))) { // Instruction ble 18, at 0x30 
nextpc=val [pc] +4; 

SET (48, -1,0) ; // instru ction scheduled , no mem access , no branching 
y // end ble at 0x30 [cond false] 

if (val[pc]==48 kk le () ) { // Instruction ble 18, at 0x30 
nextpc=24; // to 0x18 

SET (48, -1,1) ; // instru ction scheduled , no mem access , branching 
y // end ble at 0x30 [cond tru e ] 

if Cval[pc]==52) { // Instruction mov pc , Ir at 0x34 
nextpc=val [pc] +4; 

if C!is_abstractedCval[pc])) {, // effe ct of instru ction is null if abstracted 
if Cval [Ir] ==UNKNDWN) i 
val [pc] =UNKNDWN ; 

I > 

else { 

nextpc=(val[lr]) ; 

} 

} 

SET (52,-1,1); // instruction scheduled is 52 . no memory access and scheduled 
y // end mov at 0x34 



A 

end of updates for function fib 
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101: A 

102: updates for function main starting 56 ending 84 

103: */ 

104: if C val [pc ] = = 56 ) { // Instruction mov ip , sp at 0x38 
105: nextpc=val [pc] +4; 

106: if C ! i s^abs tract ed ( val [pc ]) ) i // effect of instruction is null if abstracted 

107: if (val [sp]=-DNKnDWN) •( 

108: val [ip] =DNKNDWN ; 

109: > 

110: else { 

111: val [ip] = (val [sp] ) ; 

112: } 
113: } 

114: SET (56 , - 1 , 1) ; // instruction scheduled is 56. no memory access and scheduled 

115: } // end mov at 0x38 

116: if ( val [pc ] = = 60) { // Instruction stmdb sp ! , { r4 , fp , ip , Ir , pc .y at OxSc 
117: nextpc = val [pc]+4; 

118: // push should first decrease ualfpcj and then store in stack { v al [pc ] ) 

119: push (val [pc] ) ; 

120: push (val [Ir] ) ; 

121: push (val [ip] ) ; 

122: push (val [f p] ) ; 

123: push (val [r4] ) ; 

124: SET (60 , - 1 , 1) ; // instruction scheduled is 60, no memory access 

125: > // end stmdb at OxSc 

126: 

127: 

128: 

129: if ( val [pc ] = = 76 ) i // Instruction hi 0, (unconditional) at Ox^c 
130: nextpc-0; // to 0x0 

131: val[lr]-80; 

132: SET (76, -1,1) ; // instruction scheduled , no mean access, branching 

133: } // end bl at Ox^c 

134: if ( val [pc ] = = 80) { // Instruction mov rO , r4 at 0x50 
135: nextpc=val [pc ] +4 ; 

136: if (! i s_abs tract ed ( val [pc ]) ) { // effect of instruction is null if abstracted 

137: if (val [r4]=-DNKnDWN) ^ 

138: val [rO] =DNKNDWN ; 

139: > 

140: else { 

141: val [rO] -( val [r4] ) ; 

142: } 
143: } 

144: SET ( 80 , - 1 , 1) ; // instruction scheduled is 80. no memory access and scheduled 

145: } // end mov at 0x50 

146: if ( val [pc ] = = 84) { // Instruction Idmdb fp . { r4 , fp . sp . pc ,y at 0x54 

147: nextpc = val [pc] +4; 

148: nextpc=stack (val [f p] -4) ; 

149: val [sp] = stacl[(val [fp] -8) ; 

150: nextf p=stack (val [f p] -12) ; 

151: val [r4]=stack(val [fp] -16) ; 

152: val [f p] =nextf p ; 

153: SET ( 84 , - 1 , 1) ; // instruction scheduled is 84. no memory access 

154: } // end. Idmdb at 0x54 

155: 

156: A 

157: end of updates for function main 

158: */ 

159: 

160: val [pc ] =nextpc ; 

161: } // end update 



Listing 1.6. C Program 

The generic automaton to simulate a program p± is given in Fig. [2] We 
assume that the main function of the program p± is called by another program 
and a particular value INIT_LR gives the return point. The automaton Prog 
performs some initialization (init_val()) and then computes the next state 
until the end of the program is reached: this is when the value of the pc register 
is equal to the return point INIT_LR (guard val [pc] =1N1T_LR). To simulate each 
instruction, the automaton Prog performs the following steps: 
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feed the current instruction l to the first stage of the pipehne when it is empty 
(to do so it has to synchronize with the first stage of the pipehne, on the 
fetch! channel) and compute the next state (updateO function). This also 
sets the next value of register pc. The result of update () is that the number 
of the current instruction is stored into the variable pPC [FETCH_STAGE] where 
FETCH_STAGE is the number of the first stage of the pipeline (0); 
if the instruction t in pPC [FETCH_STAGE] is an undetermined comparison 
(NDcmpCpPC [FETCH_STAGE] ) evaluates to true), the upper dashed transition 
is taken: Player 2 chooses two values n and z and the predicates that must 
be set (cmple, cmplt, etc) are set by setcmp (Listing 1.5). If t does not set 
any flag or the outcome is determined by the current state (the operands are 
all known), the middle transtion is taken (Player 2 does not have to play). 



n:int[0,1],z:int[0,1] 

NDcmp(pPC[FETCH_STAGE]) 

setcmp(pPC[FETCH_STAGE],n,z) 



initialize! init_val() 



(val[pc]=INIT_LR) 
prog_completed! 




!(val[pc]==INIT_LR) 
fetch! 
updateO 



Fig. 2. Generic Automaton Prog to Simulate a Program 



4 Model of the Hardware 

In this section we give a UPPAAL model for the architecture of the pipelined 
processor ARM9 and for the caches. 



4.1 Model of the Pipeline 

Each stage of the pipeline contains an instruction (and some other information). 
The information for each stage of the pipeline are stored in arrays: pPC [k] 
gives the number of the instruction in stage k; Todo [k] is a boolean value and 
indicates whether the instruction pPC[k] is scheduled (some instructions are 
conditional and are skipped); dataAdr [k] contains the addres^^of the memory 

For multiple loads and stores, this should be a range of addresses; this information is 
used only for determining whether a stall should occur in the pipeline. For multiple 
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cell referenced by instruction pPC[k] ( — 1 if none). There are 5 stages in the 
pipeline of the ARM9: 

— stage 1: this is the fetch stage. It fetches the next instruction (pointed to 
by the pc register) from the cache (or main memory) and this instruction 
becomes the current instruction of stage 1; 

— stage 2: decode stage. Decodes the instruction in stage 2; 

— stage 3: execute stage. Carries out the computation (addition, comparisons, 
etc) of the instruction in stage 3; 

— stage 4: memory stage. Carries out the transfers (from registers to main 
memory or main memory to registers) of the instruction in stage 4; 

— stage 5: writeback stage. Writes the value of registers that are ("writeback") 
operands of the instruction in stage 5. 

An instruction t enters the pipeline at stage 1. It is transfered from stage i to 
i + 1 as soon as possible. When it exits stage 5, it is completed. The execution 
of a program is completed when its last instruction is completed. 



Pipeline Stalls. The goal of pipelining is to split the execution of an instruc- 
tion into different simple steps. The idea being that each step can be carried out 
concurrently for different instructions: while stage 1 fetches the next instruction 
Lk, stage 2 decodes instruction Lk-i, etc. It may happen that the simple steps 
of some sequences of instructions cannot be carried out concurrently. A pipeline 
stall is a situation when one stage i of the pipeline cannot perform its computa- 
tion because it has to wait for another stage j > i to complete its computation. 
An example is when the execution of an instruction at stage 3 (execute) has an 
operand which is set in stage 4 (memory). 
The sequence of instructions of lines and 4 will 
result in a pipeline stall at stage 3 for instruction 
4: when instruction 4 (r2 :— rO — rl) is ready to 
execute at stage 3, it has to wait for instruction 



0: Idr rl, [rO] 

4: sub r2, rO, rl 



1dm rl3, {rl,r2,r3} 
add r4, r3, »1 



to complete (at stage 4) because instruction Listing 1 7 Stalls 
loads the value of memory cell rl into rO. 

Thus instruction 4 stalls for one cycl^^at stage 3. The situation for instruc- 
tions c and 10 can even result in more than one cycle delay. The Idm isntruction 
(line c) is a multiple load instruction. It loads the registers rl, r2 and r3 with 
the contents of memory cells pointed to by rl3. Stage 4 performs the loads, but 
only one per cycle. Thus instruction 10 stalls for 3 cycles at stage 3. 

A pipeline stall may occur depending on: (i) the type of the instruction at 
stage 3, and the type of the instruction at stages 4 and 5; (ii) the registers (and 
memory addresses) used by the instructions at the corresponding stages. 



loads and stores, we force a stall in a pipeline until the end of the multiple load- 
s/stores instruction. This is a safe encoding as the ARM9 does not exhibit timing 
anomalies. 

We assume that the content of memory cell was in the cache and it takes one cycle 
to be fetched. 
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Branch Prediction. When a conditional branch instruction enters the pipehne, 
the next instruction to flow in is determined by the truth value of the condition. 
This value might not yet be available when the branch instruction is in the first 
stage of the pipeline. If the condition is determined by the value of a variable 
which is not in the cache, it might take a few cycles before the result becomes 
available. In this case, we should stall until the outcome of the comparison is 
computed. This might however be inefficient. 

Some heuristics can be applied to guess the most plausible next instruction 
after a conditional branching. After the prediction, the chosen instruction flows 
in the pipeline. If the guess was right the result is shortest execution time for this 
part of the program. If the guess was wrong, the computations of the mistakenly 
taken branch have to be undone, and the pipeline flushed which results in a 
longer execution time. We do not discuss here the choice of a good heuristics, 
but there are a few options that gve good results on average. 

In our model we follow j20] and model the heuristics for branch prediction 
by: in a conditional branch, a branch is never taken (other heuristics can be 
accommodated for in our model). 

UPPAAL Pipeline Model. The timed automata models we introduce are 
close to the ones proposed in 20 . However there are some differences as we do 
not have the same model for the program. 

The timed automata for each stage (ARM9, 5 stages) are depicted on Fig. |3] 
and Fig. |4] The stage modelled by each automaton can be infered by the syn- 
chronization channel from the initial state (e.g., decode?). The first stage of the 
pipeline is of particular importance as it models the case of a wrong guess in an 
branch prediction. The automaton of Fig. [3] models the following behaviour: 

1. the automaton accepts a fetch? synchronization when it is idle; 

2. after accepting an instruction (fetch? synchronizes with fetch! in the 
automaton Prog of Fig. [2|, it actually fetches the instruction from main 
memory via the instruction cache (CacheReadStart [INSTR_CACHE] ! , where 
INSTR_CACHE is the ID or the instruction cache); 

3. when the instruction has been read from the cache or main memory, there 
are two options: 

(a) the instruction t to be processed is a conditional branch (condition 
type_of (pPC [me] )==G4c) and the variable Todo [me] indicates whether 
the condition was evaluated to true or false. In case it is a condi- 
tional branch and the condition was true, we simulate two "instruction 
read from the cache" steps: indeed our branch prediction algorithm is 
"never branch" and thus if it happened that we had to branch, we should 
simulate a pipeline flush. As we do not execute the instructions in the 
pipeline (but rather when we feed the first stage of the pipeline), this can 
be modelled by reading the next two instructions (the "never branch" 
prediction) without executing them, and then resuming the simulation 
from the target address of the branch instruction. 
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fetch_completed!^ 



prog_completed? 



decode! 
copy(me,me+1) 



CacheReadEnd[INSTR_CACHE]? 



CacheReadStart[INSTR_CACHEl! 

PC+=BLK_SIZE 



fetch-J CacheReadStart[lNSTR_CACHE]! 




type_of(pPC[me])==G4c S.81 Todo[me] 
CacheRBadEnd[INSTR_CACHE]? 



CacheReadStart[INSTR_CACHE]! 
PC+=BLK_SIZE 



CacheReadEnd[INSTR_CACHE]? ^ 



Fig. 3. Timed Automata Model of the ARM9 Pipeline 



(b) the instruction to be processed is not a conditional branching or the 
condition was evaluated to false; in this case the prediction was right 
and nothing has to be undone. 
After an instruction has been fetched in the fetch stage, it is fed to the 
next stage of the pipeline. This is modelled by the decode ! synchronization 
and the copy(me,me+l) transition. copy(me,me+l) copies the information 
in pPC [me] , Todo [me] and dataAdr [me] to the next stage me+1. 

The memory stage automaton is a bit more involved than the others as it 
has to take into account different options: if the instruction is a memory transfer 
(type_of (pPC[me-l] )==G2LDR or type_of (pPC [me-1] )==G2STR) and is sched- 
uled (Todo[me-l] is true) a synchronization with the data cache is requested. 

The type of the instructions is given by a UPPAAL function type_of . The 
duration is also given by a function durO (used in the execute stage). 



4.2 Model of the Caches 



A cache is a fast memory device. It is characterized by its size K (usually in 
Kbytes), the length of a cache line {B in Bytes) and the number of cache lines 
T - ^ 

The main memory of a computer is divided into blocks equal to the length 
of the cache line. We let M = {mo, wi, • • ■ , to„}. 

The associativity of a cache determines where a memory block can reside. 
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decodecompleted! 



Todo[me-1] && ( 
type_of(pPC[mB-1])==G2LDR II 
type_of(pPC[mB-1])==G2STR) 

memory? 

t=0 



type_of(pPC[me))==G2LDR 
CacheReadStart[DATA_CACHE] 
CD=dataAdr[me] 




CacheReadEnd[DATA_CACHE]? 



CacheWriteEnd[DATA_CACHE]? 



type_of(pPC[me])==G2STH 
CacheWritBStart[DATA_CACH^] 
CD=dalaAdr[me] 



!Todo[me-1] II ( 

type_of(pPC[me-1])!=G2LDR && 
type_of(pPC[m9-1])!=G2STR) 
memory? t=0 



execute_completed? 



memory _completed ! 





Fig. 4. Timed Automata Model of the ARM9 Pipeline 



23 



— fully associative: a block can be in any line; 

— direct mapped: a block can be in one line; 

— j-way: a block can be in j different lines; in this case the cache is partitionned 
into J different sets. Fully and direct mapped are particular instances of 
j-way caches. The partition induced by the j-way cache is denoted V = 
{Pi,... 

The set of lines a memory can reside in is given by a mapping k : M V . 
The replacement policy determines which block to eject from memory when the 
cache is full. The most common policies are: 

— LRU: least recently used; 

— FIFO: first-in first-out; 

— alternate and mixed and even random are permitted but not easily pre- 
dictable. 

Handling writing requests is also a distinctive feature of a cache. 

— handling write hits: 

• write trough: write cache and memory 

• write back: write cache; need for a dirty bit whihc is taken care of when 
ejecting a line from the cache; 

— handling write misses: 

• write allocate: write memory and fetch into cache; 

• write no allocate: write memory (no fetch). 

In this paper we model a cache with FIFO replacement policy and assume write 
allocate on a write/miss. 

UPPAAL Cache Model. The automaton modeling the behaviour of the cache 
(together with the model iof the main memory automaton) is given in Fig. [5] 

After performing some initializations (initCache () , setting the initial state 
of the cache), it accepts either write or read requests. Depending on the request, 
and wether a cache line is dirty or not, a number of memory transactions (PMT) 
are needed to fetch the content of memory cell m. Each such transaction is per- 
formed one after the other. When it is completed the transfer from the cache to 
the register of the processor takes place and require CACHE_SPEED time units. 

5 Tool Chain and Case Studies 

We have applied the previous framework to a number of benchmarks from 
Malardalen University. 
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x==CACHE_SPEED && !op_write 

CacheReadEndfnuml! 

op_write=0 

x==CACHE_SPEED && op_write 
CacheWhteEndfnumI! 

op_write=0 




CacheReadStart[num]? 
PMT=is_in(m)?update(m,0):insert(m,0) 



CacheWriteStart[num]? 
PMT=is_in(m)?update(m,1 ):insert(m,1 ) 




MainMemEnd? 



PMT-- 



PMT>0 && m>=0 
MainMemStart! 



PMT==0 Hurry! x=0 



x<=CACHE_SPEED 



MainMemStart? t=0 




t==MAINMEMTRANS 
MainMemEnd! 



Fig. 5. Timed Automata Model for tlie Caches 
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Tool Chain. The tool chain to compute WCET is depicted on Fig. |6] The 
component we have developed are ARM2UPP and PATCH_UPP: 

— ARM2UPP takes as input a program in assembly (file. arm) that has been 
annotated with the comparisons operators for each instruction that sets a 
status bit. It generates four files: 

• file, {xml , q} that contain respectively the UPPAAL network automata 
(and functions like update () etc) modeling the execution of the pro- 
gram on the architecture of the ARM9 and the UPPAAL queries to 
compute/check the WCET; 

• file-reach is an executable obtained by compiling file-reach. cpp; 
this latter file is a C-|— I- program that simulates the program in file . arm. 
file-reach always terminates. However, early termination can be forced 
by passing some parameters (maximal number of states, maximal num- 
ber of split cases). In case the number of split cases is too large (e.g., 2^° 
for Bubble Sort), it is possible to add some information in the file 
file-reach. cpp like constraints on the outcome of an unknown com- 
parison. This step may be iterated several times. When it is completed 
the file file . info contains some useful information (like maximal stack 
size, etc). 

• file-equiv is an executable obtained by compiling f ile-equiv . cpp; 
this program checks whether an abstraction mapping (which is given by 
a function) is valid or not (implements the algorithm of section^). 

— PATCH_UPP modifies some constants in file. xml to incorporate the informa- 
tion from file, info (like stack size) and can also include the function of 
abstracted instructions (if it has been declared valid). 

UPPAAL-TiGA Queries. In order to compute the WCET of a program, we 
can check wether the program always terminates within k time units. This can 
be computed using a binary search with UPPAAL. The drawback of this check 
is that some deadlock may occur in the system, yielding a biased value of the 
WCET. 

An alternative way of computing the WCET is check a control property: 
"Can Player 1 enforce termination of the program and if yes, what is the best 
duration he can guarantee?" This optimal time reachability control objective 
can be checked in one query (see [22]) with UPPAAL-TiGA, provided we know 
an upper of the WCET. This can be roughly over-estimated on the program (we 
have not implemented this part yet) . Optimal reachability of a location 1 is then 
specified by the control objective: 

control (#n,0) : A [ true U 1 ] 

if #n is a rough upper bound of the WCElj^ 

Program termination in the UPPAAL model happens when the location DONE 
is reached in the writeBackStage automaton (last stage of the pipeline). Thus 
the control property we check is: 

If #n is not large enough, UPPAAL-TiGA result will be "not controllable" . 
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file.c > ( file elf) 



objdump 



file.xml 
file.q 



ARM2UPP 



Abstracted 
Instructions 



not ok* 



file-reach 



stack size 
# of split cases 
Unknown address 



file-patch. xml 
file-patch, q 



» PATCH.UPP <r 



ok 



* file. info 



file-equiv 



YES/NO 



T 



Fig. 6. Tool Chain Overview 



control (#11,0) : A [ true U WriteBackStage .DONE ] 

case Studies & Results We have appUed the framexork described in Fig. |6] 
to a number of benchmark programs from Malardalen University. We could not 
analyse the full set of programs because of the current limitations of our tools: 

— floating point operations are not supported yet; 

— a few operators (e.g., ror) of the ARM9 assembly language are not supported 
yet. 

There are not many published results about the actual WCET of the benchmarks 
(or when there are, the hardware parameters, cache speed, etc are not given). 
To evaluate the relevance of our method, we compare our results to the ones 
obtained with the METAMOC method [20j. 

There are 15 programs that can be analysed by METAMOC using a concrete 
instruction cache and an "always miss" data cache. Only 7 of the 15 programs 
can be analysed with both a concrete instruction and data cache. Using our en- 
coding and tool chain, we could analyse 13 out of these 15 programs (two of them 
contains unsupported operations) with concrete caches. Moroever, the time/s- 
pace needed to compute the results is very small compared to the resources used 
in METAMOC (32GB RAM computer). Table [l] give the values of WCET for 
each program, and the time for UPPAAL-TiGA to compute the result. The time 
needed to compute the intermediary files is negligible. The timing specification 
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of the caches are: CACHE_SPEED=1 (processor) cycle is the same as the processor 
speed, and a memory transaction takes 10 processor cycles. The UPPAAL files 
are available from http: //www. irccyn.f r/freLnck/wcet[ 



Energy/Power Consumption Optimization. The last column of Table [T] 
gives the percentage of time the processor can run at a slower clock rate (l/4th 
of its fastest speed) without any impact on the WCET: this is due to the 
initial transient phase of the execution of a program where instructions are 
loaded into the cache. For some small programs the result is impressive (22% 
for jsoine-complex). To do this we just add a automaton to the network that 
switches the rate from 4 to 1 after a certain amount of time. Another interesting 
and easy computation that can be done, is to fix the time the processor runs at 
a slower rate (in the initial phase) and compute the optimal time to reach the 
end the program (which is the WCET) under this constraint. 



Program 


loot 




UPPAAL-TiGA 
time/space 


WCET 


Abs* 


Low 
Power 


Single-Path Programs 


fac 


26 





0.35s/6.91MB 


1883 


4/34 


26/1.3% 


fib 


74 





0.25s/5.68MB 


571 


4/22 


26/4.5% 


j annc-complex* 


65 





0.54s/7.76MB 


792 


0/23 


176/22% 


matmult* 


162 





119. 2s/936. 75MB 


614827 


31/107 


800/0.001% 


jfdcint 


374 





7.13s/55.99MB 


49017 


394/454 


108/0.22% 


expint(50,l) 


81 





6.08s/59.16MB 


65042 


0/124 


70/1.7% 


expint(50,21) 


81 





3.65s/43.21MB 


41015 


0/124 


71/1.7% 


fdct 


238 





2.83s/26.79MB 


26099 


0/286 


90/0.3% 


edn* 


284 





22.28s/230.98MB 


62968 


0/460 


26/0.04% 


recursion* 


41 





2.68s/28.82MB 


10335 


0/38 


32/0.3% 


Multiple-Paths Programs 


bs 


174 


5 


0.52s/6.52MB 


366 


0/22 


30/8.2% 


cnt* 


115 


100 


100.25s/377.02MB 


6483 


0/82 


40/0.06% 


insertsort* 


91 


675 


9.36s/81.27MB 


27061 


0/53 


400/1.4% 


ns* 


497 


625 


12.38s/110.92MB 


43239 


0/41 


32/0.0007% 



Uines of code in the C source file ''A'' = Max number of Player 2 moves along a path 
^Abstracted Instr./Instr. *Program selected for the WCET Challenge 2006 pi] 

Table 1. Results (C programs compiled with gcc -02) 



6 Conclusion 

In this paper we have presented a framework based on timed games and the 
model checker UPPAAL-TiGA to compute WCET for programs running on 
architectures featuring pipelines and caches. 
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The results we have obtained support the claim that model checking is ade- 
quate for computing WCET. Moreover UPPAAL-TiGA could be tuned to han- 
dle WCET computation more efficiently: priorities between processes can reduce 
unnecessary interleavings and there are not yet implemented in UPPAAL-TiGA 
(though they are in UPPAAL); a lot of time is spent checking whether a new 
state has already been encoutered: this will never be the case in the programs 
we check (otherwise they would be an infinite loop). Disabling this check would 
also reduce the time to compute the results. Of course, a program like Bubble 
Sort remains beyond the scope of analysis within our framework. Nevertheless, 
what we advocate is the combination of different techniques to solve the WCET 
problem: abstract interpretation (AI) combined with Interger Linear Program- 
ming (ILP) have given very good results [11] but this method is yet to prove 
that: (1) it can be easily adapted to different processors and (2) it can take into 
account power related features (like change of speed of the processor) . 

Our ongoing work focuses on two aspects: 

1. extend the set of instructions supported by our compiler and provide models 
for other architectures (like ARMll); 

2. add a pre-processing step to prune the execution tree of the program. The 
goal of this step is to reduce the number of paths of the program still pre- 
serving the paths giving the WCET. This step can be carried out using 
ILP techniques, or counter-example guided abstraction refinement (CEGAR) 
methods pS] . 
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